true

1. Assignment requirements

Mini-Challenge 1

Mini-Challenge 1 looks at the relationships and conditions that led up to the kidnapping. As an analyst, you have a set of current and historical news reports at your disposal, as well as resumes of numerous GAStech employees and email headers from two weeks of internal GAStech company email. Can you identify the complex relationships among all of these people and organizations?

Background

Note: This scenario and all the people, places, groups, technologies, contained therein are fictitious. Any resemblance to real people, places, groups, or technologies is purely coincidental.

In the roughly twenty years that Tethys-based GAStech has been operating a natural gas production site in the island country of Kronos, it has produced remarkable profits and developed strong relationships with the government of Kronos. However, GAStech has not been as successful in demonstrating environmental stewardship.

In January, 2014, the leaders of GAStech are celebrating their new-found fortune as a result of the initial public offering of their very successful company. In the midst of this celebration, several employees of GAStech go missing. An organization known as the Protectors of Kronos (POK) is suspected in the disappearances.

It is January 21, 2014, and as an expert in visual analytics, you have been tasked with helping people understand the complex relationships among people and organizations that may have contributed to these events.

Tasks and Questions:

Mini-Challenge 1 looks at the relationships and conditions that led up to the kidnapping. As an analyst, you have a set of current and historical news reports at your disposal, as well as resumes of numerous GAStech employees and email headers from two weeks of internal GAStech company email. Can you identify the complex relationships among all of these people and organizations?

Use visual analytics to analyze the available data and develop responses to the questions below. In addition, prepare a video that shows how you used visual analytics to solve the challenge.

  1. Characterize the news data sources provided. Which are primary sources and which are derivative sources? What are the relationships between the primary and derivative sources? Please limit your answer to 8 images and 300 words.

  2. Characterize any biases you identify in these news sources, with respect to their representation of specific people, places, and events. Give examples. Please limit your answer to 6 images and 500 words.

  3. Given the data sources provided, use visual analytics to identify potential official and unofficial relationships among GASTech, POK, the APA, and Government. Include both personal relationships and shared goals and objectives. Provide evidence for these relationships. Please limit your answer to 6 images and 400 words.

2. data preparation

packages = c('igraph','tidygraph',
             'ggraph','visNetwork',
             'lubridate','clock',
             'tidyverse',"tm",
             'tidytext', 
             'widyr', 'wordcloud',
             'DT', 'ggwordcloud', 
             'textplot','hms',
             'tidygraph', 'ggraph',
             'igraph','flipTime')
for(p in packages){
  if(!require(p,character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

read in data source

GAStech_nodes <- read_csv("data source/MC1/EmployeeRecords.csv")
GAStech_edges <- read_csv("data source/MC1/email headers.csv")
GAStech_nodes$Name <- paste(GAStech_nodes$FirstName,GAStech_nodes$LastName)
GAStech_nodes$Name <- gsub(" ",".",GAStech_nodes$Name)
GAStech_nodes$id <- c(1:nrow(GAStech_nodes))
GAStech_nodes$military[GAStech_nodes$MilitaryDischargeDate != "NA"] <- "YES"
GAStech_nodes$military[is.na(GAStech_nodes$military)] <- "NO"

change data type

GAStech_nodes$BirthDate = mdy(GAStech_nodes$BirthDate)
GAStech_nodes$CitizenshipStartDate = mdy(GAStech_nodes$CitizenshipStartDate)
GAStech_nodes$PassportIssueDate = mdy(GAStech_nodes$PassportIssueDate)
GAStech_nodes$PassportExpirationDate = mdy(GAStech_nodes$PassportExpirationDate)
GAStech_nodes$CurrentEmploymentStartDate = mdy(GAStech_nodes$CurrentEmploymentStartDate)
GAStech_nodes$MilitaryDischargeDate = mdy(GAStech_nodes$MilitaryDischargeDate)

Switch column orders

GAStech_nodes <- GAStech_nodes[, c(20, 19, 15, 12, 13, 4:8, 21, 1:3, 9:11, 14, 16:18)]
GAStech_nodes$Name <- tolower(GAStech_nodes$Name)
GAStech_nodes_new <- GAStech_nodes
GAStech_nodes_new$Name[GAStech_nodes_new$Name == "sten.sanjorge.jr."] <- "sten.sanjorge.jr"

to process gestech edges data & data cleaning

GAStech_edges_new <- GAStech_edges %>% 
  separate_rows(To)
GAStech_edges_new$From <- removeWords(GAStech_edges_new$From, "@gastech.com.kronos")
GAStech_edges_new$Date <- mdy_hm(GAStech_edges_new$Date)
GAStech_edges_new$Weekday <-  wday(GAStech_edges_new$Date,
                                 label = TRUE,
                                 abbr = FALSE)

GAStech_edges_new <- GAStech_edges_new[GAStech_edges_new$From != "gastech.com.kronos",]
GAStech_edges_new <- GAStech_edges_new[GAStech_edges_new$To != "gastech.com.kronos",]
GAStech_edges_new <- GAStech_edges_new[GAStech_edges_new$From != "@gastech.com.tethys",]
GAStech_edges_new <- GAStech_edges_new[GAStech_edges_new$To != "@gastech.com.tethys",]

GAStech_edges_new$From <- tolower(GAStech_edges_new$From)
GAStech_edges_new$To <- tolower(GAStech_edges_new$To)
GAStech_edges_new$From[GAStech_edges_new$From == "sten.sanjorge jr."] <- "sten.sanjorge.jr"

GAStech_edges_new <- GAStech_edges_new[GAStech_edges_new$To != GAStech_edges_new$From,]
GAStech_edges_new <- GAStech_edges_new[GAStech_edges_new$Weekday != "NA",]

vc <- unique(GAStech_nodes_new$Name)
GAStech_edges_new <- GAStech_edges_new[GAStech_edges_new$From %in% vc,]
GAStech_edges_new <- GAStech_edges_new[GAStech_edges_new$To %in% vc,]
nodes_ref <- GAStech_nodes_new[,1:2]

colnames(nodes_ref) <- c("Source", "From")
GAStech_edges_new1 <- merge(GAStech_edges_new, nodes_ref[, c("Source", "From")], by="From")
colnames(nodes_ref) <- c("Target", "To")
GAStech_edges_new1 <- merge(GAStech_edges_new1, nodes_ref[, c("Target", "To")], by="To")

GAStech_edges_aggregated <- GAStech_edges_new1 %>%
  group_by(Source, Target, Weekday) %>%
    summarise(Weight = n()) %>%
  filter(Weight > 1) %>%
  ungroup()

GAStech_graph <- tbl_graph(nodes = GAStech_nodes_new,
                           edges = GAStech_edges_aggregated, 
                           directed = TRUE)

GAStech_graph %>%
  activate(edges) %>%
  arrange(desc(Weight))
## # A tbl_graph: 54 nodes and 1813 edges
## #
## # A directed multigraph with 2 components
## #
## # Edge Data: 1,813 x 4 (active)
##    from    to Weekday Weight
##   <int> <int> <ord>    <int>
## 1    40    41 Tuesday     23
## 2    40    43 Tuesday     19
## 3    41    43 Tuesday     17
## 4    42    40 Tuesday     17
## 5    41    40 Tuesday     16
## 6    42    41 Tuesday     15
## # ... with 1,807 more rows
## #
## # Node Data: 54 x 21
##      id Name  EmailAddress CurrentEmployme~ CurrentEmployme~ BirthCountry Gender
##   <int> <chr> <chr>        <chr>            <chr>            <chr>        <chr> 
## 1     1 mat.~ Mat.Bramar@~ Administration   Assistant to CEO Tethys       Male  
## 2     2 anda~ Anda.Ribera~ Administration   Assistant to CFO Tethys       Female
## 3     3 rach~ Rachel.Pant~ Administration   Assistant to CIO Tethys       Female
## # ... with 51 more rows, and 14 more variables: CitizenshipCountry <chr>,
## #   CitizenshipBasis <chr>, CitizenshipStartDate <date>, military <chr>,
## #   LastName <chr>, FirstName <chr>, BirthDate <date>, PassportCountry <chr>,
## #   PassportIssueDate <date>, PassportExpirationDate <date>,
## #   CurrentEmploymentStartDate <date>, MilitaryServiceBranch <chr>,
## #   MilitaryDischargeType <chr>, MilitaryDischargeDate <date>

text data preparation

all_news <- "data source/MC1/News Articles/"

read_folder <- function(infolder) {
  tibble(file = dir(infolder, 
                    full.names = TRUE)) %>%
    mutate(text = map(file, 
                      read_lines)) %>%
    transmute(id = basename(file), 
              text) %>%
    unnest(text)
}

raw_text <- tibble(folder = 
                     dir(all_news, 
                         full.names = TRUE)) %>%
  mutate(folder_out = map(folder, 
                          read_folder)) %>%
  unnest(cols = c(folder_out)) %>%
  transmute(newsgroup = basename(folder), 
            id, text)
write_rds(raw_text, "data source/MC1/rds/all_news.rds")

raw_text_backup <- raw_text

clean text and prepare tables

raw_text1<-raw_text[!(raw_text$text==""|raw_text$text=="<< Continue reading the main story >>"),]
raw_text1<-raw_text1[!(raw_text1$text==" "|raw_text1$text=="<< to continue reading main history >>"),]
raw_text1<-raw_text1[!(raw_text1$text=="<< to continue reading main history >>"),]
cleaned_text <- raw_text1 %>%
  filter(str_detect(text, "^[^>]+[A-Za-z\\d]")
         | text == "",
         !str_detect(text, 
                     "writes(:|\\.\\.\\.)$"),
         !str_detect(text, 
                     "^In article <"),
         !str_detect(text, 
                     "<<"),
         !str_detect(text,
                     "To continue reading main history"),
         !str_detect(text,
                     " to continue reading main history "),
         !str_detect(text,
                     "continue to read the principal history"),
         !str_detect(text,
                     "SOURCE:")
  )
time_text <- raw_text1 %>%
  filter(str_detect(text, "^[^>]+[A-Za-z\\d]")
         | text == "",
         str_detect(text, 
                     "PUBLISHED:")
  )
time_text1 <- time_text
time_text1$text <- removeWords(time_text1$text, "PUBLISHED:")
time_text1$text <- removeWords(time_text1$text, "PUBLISHED: ")
time_text1$text1 <- dmy(time_text$text)
time_text1$text1 <- format(as.Date(time_text1$text1),'%d/%m/%Y')
time_text1$text2 <- ymd(time_text$text)
time_text1$text2 <- format(as.Date(time_text1$text2),'%d/%m/%Y')
time_text1$Time <- ifelse(is.na(time_text1$text1),
                           time_text1$text2,time_text1$text1)
time_text1 <- time_text1[,c(1,2,6)]
time_text1 <- time_text1[complete.cases(time_text1), ]
title_text <- raw_text1 %>%
  filter(str_detect(text, "^[^>]+[A-Za-z\\d]")
         | text == "",
         str_detect(text, 
                     "TITLE:")
  )
title_text$text <- removeWords(title_text$text, "TITLE: ")
colnames(title_text) <- c("News Source", "id", "Title")
cleaned_text <- cleaned_text %>%
  filter(!str_detect(text, 
                     "PUBLISHED:")
  )
## display the tale with time
title_new <- merge(title_text, time_text1[, c("id", "Time")], by="id")

usenet_words <- cleaned_text %>%
  unnest_tokens(word, text) %>%
  filter(str_detect(word, "[a-z']$"),
         !word %in% stop_words$word)

words_by_newsgroup <- usenet_words %>%
  count(newsgroup, word, sort = TRUE) %>%
  ungroup()

3. Assignment answering

1) Characterize the news data sources provided. Which are primary sources and which are derivative sources? What are the relationships between the primary and derivative sources? Please limit your answer to 8 images and 300 words.

The similarity can be visualize by checking the most frequent words in each of the data source. In order to find out the correlation between different news sources, we firstly compute the tf-idf within each data source.

Can see directly from the top 12 highest tf-idf words of each group that, some groups share similar keywords. For example “candy”, appears in four sources:“News Online Today”, “The Continent”, “International Times”, “World Source”. From the DT table below we can type in “candy” and see the relative tf-idf values. Also, each source of news have certain preference of focus,therefore, for each topic the news, the primary source are more likely comes from the source with higher tf-idf scores.

word cloud demonstration off all news & EDA for different News Media

wordcloud(words_by_newsgroup$word,
          words_by_newsgroup$n,
          max.words = 300)

raw_text %>%
  group_by(newsgroup) %>%
  summarize(messages = n_distinct(id)) %>%
  ggplot(aes(messages,newsgroup)) +
  geom_col(fill = "lightblue") +
  labs(y = NULL)

Computing tf-idf within newsgroups

tf_idf <- words_by_newsgroup %>%
  bind_tf_idf(word, newsgroup, n) %>%
  arrange(desc(tf_idf))

Visualising tf-idf as interactive table

DT::datatable(tf_idf, filter = 'top') %>% 
  formatRound(columns = c('tf', 'idf', 
                          'tf_idf'), 
              digits = 3) %>%
  formatStyle(0, 
              target = 'row', 
              lineHeight='50%')

Visualising tf-idf within newsgroups

tf_idf %>%
  group_by(newsgroup) %>%
  filter(newsgroup == "Centrum Sentinel"|newsgroup == "Modern Rubicon"|
         newsgroup == "Tethys News"|newsgroup == "News Online Today"|
         newsgroup == "Kronos Star"|newsgroup == "Central Bulletin") %>%
  slice_max(tf_idf, 
            n = 12) %>%
  ungroup() %>%
  mutate(word = reorder(word, 
                        tf_idf)) %>%
  ggplot(aes(tf_idf, 
             word, 
             fill = newsgroup)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ newsgroup, 
             scales = "free") +
  labs(x = "tf-idf", 
       y = NULL)

tf_idf %>%
  group_by(newsgroup) %>%
  filter(newsgroup == "The Guide"|newsgroup == "Worldwise"|
         newsgroup == "Homeland Illumination"|newsgroup == "Athena Speaks"|
         newsgroup == "The Explainer"|newsgroup == "Daily Pegasus") %>%
  slice_max(tf_idf, 
            n = 12) %>%
  ungroup() %>%
  mutate(word = reorder(word, 
                        tf_idf)) %>%
  ggplot(aes(tf_idf, 
             word, 
             fill = newsgroup)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ newsgroup, 
             scales = "free") +
  labs(x = "tf-idf", 
       y = NULL)

tf_idf %>%
  group_by(newsgroup) %>%
  filter(newsgroup == "The Orb"|newsgroup == "The Abila Post"|
         newsgroup == "World Source"|newsgroup == "News Desk"|
         newsgroup == "Everyday News"|newsgroup == "The Continent") %>%
  slice_max(tf_idf, 
            n = 12) %>%
  ungroup() %>%
  mutate(word = reorder(word, 
                        tf_idf)) %>%
  ggplot(aes(tf_idf, 
             word, 
             fill = newsgroup)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ newsgroup, 
             scales = "free") +
  labs(x = "tf-idf", 
       y = NULL)

tf_idf %>%
  group_by(newsgroup) %>%
  filter(newsgroup == "The General Post"|newsgroup == "The Tulip"|
         newsgroup == "The Wrap"|newsgroup == "World Journal"|
         newsgroup == "The Truth"|newsgroup == "The Light of Truth") %>%
  slice_max(tf_idf, 
            n = 12) %>%
  ungroup() %>%
  mutate(word = reorder(word, 
                        tf_idf)) %>%
  ggplot(aes(tf_idf, 
             word, 
             fill = newsgroup)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ newsgroup, 
             scales = "free") +
  labs(x = "tf-idf", 
       y = NULL)

tf_idf %>%
  group_by(newsgroup) %>%
  filter(newsgroup == "International Times"|newsgroup == "Who What News"|
         newsgroup == "The World"|newsgroup == "All News Today"|
         newsgroup == "International News") %>%
  slice_max(tf_idf, 
            n = 12) %>%
  ungroup() %>%
  mutate(word = reorder(word, 
                        tf_idf)) %>%
  ggplot(aes(tf_idf, 
             word, 
             fill = newsgroup)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ newsgroup, 
             scales = "free") +
  labs(x = "tf-idf", 
       y = NULL)

Also, word clouds are presented to better computing the main topic from each of media.

Word-cloud for each of the News Media

set.seed(1234)
words_by_newsgroup %>%
  filter(n > 15) %>%
  ggplot(aes(label = word,
             size = n)) +
  geom_text_wordcloud() +
  theme_minimal() +
  facet_wrap(~newsgroup)

The following DT table allows us to select text regarding to time, see the News sources and the title of News, can scroll and type in key words to select intended news and find out the primary source and secondary source reagrding to key words.

title-time DT table
DT::datatable(title_new)

Counting and correlating and See the correlation between each of news sources

newsgroup_cors <- words_by_newsgroup %>%
  pairwise_cor(newsgroup, 
               word, 
               n, 
               sort = TRUE)

The following DT table present the correlations between each of the News Media(New Source), and higher the correlations are, the more likely that two news media are sharing similar news are focusing on similar area of news. Also, they may be Primary source and Secondary source for each other, when combining with the title-time DT table can demonstrate better output.

Type in the intended news media and see the results

newsgroup_cors1 <- newsgroup_cors
DT::datatable(newsgroup_cors, filter = 'top') %>% 
  formatRound(columns = c('item1', 'item2', 
                          'correlation'), 
              digits = 3) %>%
  formatStyle(0, 
              target = 'row', 
              lineHeight='50%')

By setting and increasing the threshold of correlation, a step-by-step community detection using visualization if performed. And can see from the graph that for a threshold of 0.7, two main communities can be distinguished, and medias from the same community are more correlated than other medias from other community.

Visualising correlation as a network

set.seed(2017)
newsgroup_cors %>%
  filter(correlation > .7) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(alpha = correlation, 
                     width = correlation)) +
  geom_node_point(size = 6, 
                  color = "lightblue") +
  geom_node_text(aes(label = name),
                 color = "red",
                 repel = TRUE) +
  theme_void()

Three communities are separated for a threshold of 0.75.

set.seed(2017)
newsgroup_cors %>%
  filter(correlation > .75) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(alpha = correlation, 
                     width = correlation)) +
  geom_node_point(size = 6, 
                  color = "lightblue") +
  geom_node_text(aes(label = name),
                 color = "red",
                 repel = TRUE) +
  theme_void()

And five communities are separated for a threshold of 0.8.

set.seed(2017)
newsgroup_cors %>%
  filter(correlation > .8) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(alpha = correlation, 
                     width = correlation)) +
  geom_node_point(size = 6, 
                  color = "lightblue") +
  geom_node_text(aes(label = name),
                 color = "red",
                 repel = TRUE) +
  theme_void()

2) Characterize any biases you identify in these news sources, with respect to their representation of specific people, places, and events. Give examples. Please limit your answer to 6 images and 500 words.

All_text <- merge(cleaned_text, title_text[,c("id","Title")], by="id")
All_text <- merge(All_text, time_text1[, c("id", "Time")], by="id")
All_text$id <- iconv(All_text$id, to = "utf-8")
All_text$newsgroup <- iconv(All_text$newsgroup, to = "utf-8")
All_text$text <- iconv(All_text$text, to = "utf-8")
All_text$Title <- iconv(All_text$Title, to = "utf-8")
All_text$Time <- iconv(All_text$Time, to = "utf-8")
DT::datatable(All_text)
  1. One of the bias can be found is one emplyee of the company was related in the news that related to the chaos and conflict.

Also, Juliana Vann,a young girl died due to the environment pollution, share the same family name with two of the male employees of Gastech.

Notice the high frequency in News that the name of Juliana Vann appears.

Also, one of the emplyee is the member of POK as well as the older brother of Juliana Vann, as is shown in the picture attached.

And it is noticeable that high frequency of the family name “Vann” appears in the news.

A word “bend” constantly appears in the News, this could be related to Tiskele Bend fields, which is the main focus of POK targeting to environment protection, as well as the main business and source of income for GAStech.

3) Given the data sources provided, use visual analytics to identify potential official and unofficial relationships among GASTech, POK, the APA, and Government. Include both personal relationships and shared goals and objectives. Provide evidence for these relationships. Please limit your answer to 6 images and 400 words.

Plotting network graph

GAStech_nodes_new <- GAStech_nodes_new %>%
  rename(group = CurrentEmploymentType)
GAStech_nodes_new$title <- GAStech_nodes_new$Name
GAStech_nodes_new$label <- GAStech_nodes_new$Name
GAStech_edges_aggregated1 <- GAStech_edges_new1 %>%
  left_join(GAStech_nodes_new, by = c("From" = "Name")) %>%
  rename(from = id) %>%
  left_join(GAStech_nodes_new, by = c("To" = "Name")) %>%
  rename(to = id) %>%
  ungroup()
GAStech_edges_aggregated2 <- GAStech_edges_aggregated1
GAStech_edges_aggregated2$cancatenate <- paste(GAStech_edges_aggregated2$Weekday,"--",
                                               GAStech_edges_aggregated2$Subject) 
GAStech_edges_aggregated2$title <- iconv(GAStech_edges_aggregated2$cancatenate, to = "utf-8")
GAStech_edges_aggregated2 <- GAStech_edges_aggregated2[,c(8,30,53)]

In the given interactive network graph, we are able to drag each node and see the email subject that related to the two persons. Each of the Node are colored by the department this person belongs to. And the color of the edges can help us to distinguish which department the email is sent by. Therefore we need to focus on the edges(emails) that colored difference from the target nodes, those are emails communicating between different departments.

And if we hover the mouse onto the edges, we are able to distinguish the exact subject of sent emails. Also, since we’ve already suspect the two employees from security department whose famaly names “Vann”, we are able to zoom in and have a closer look of their email exchanges with others.

visNetwork(GAStech_nodes_new,
           GAStech_edges_aggregated2) %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visEdges(arrows = "to", 
           smooth = list(enabled = TRUE, 
                         type = "curvedCW")) %>%
  visLegend() %>%
  visLayout(randomSeed = 123)

fr graph with id selection

visNetwork(GAStech_nodes_new,
           GAStech_edges_aggregated2) %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visOptions(highlightNearest = TRUE,
             nodesIdSelection = TRUE) %>%
  visLegend() %>%
  visLayout(randomSeed = 123)

The general approach of consider all the emails between difference departments are personal can be analysed more accurate. Since the edges will only present one text for connection between two nodes, and the weight of edges shows the frequency of emailing, an approachable way is to search in the below DT table below and connect check all the emails to distinguish whether emails are offical or personal.

GAStech_edges_aggregated1$From <- iconv(GAStech_edges_aggregated1$From, to = "utf-8")
GAStech_edges_aggregated1$To <- iconv(GAStech_edges_aggregated1$To, to = "utf-8")
GAStech_edges_aggregated1$Date <- iconv(GAStech_edges_aggregated1$Date, to = "utf-8")
GAStech_edges_aggregated1$Subject <- iconv(GAStech_edges_aggregated1$Subject, to = "utf-8")
GAStech_edges_aggregated1$Weekday <- iconv(GAStech_edges_aggregated1$Weekday, to = "utf-8")
Node_present <- GAStech_edges_aggregated1[,c('From', 'To', 'Date','Subject','Weekday')]

DT::datatable(Node_present)

4. Reference

https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089

https://datastorm-open.github.io/visNetwork/edges.html